To submit homework, please submit Rmd and html files to bruinlearn by the deadline.
1 Q1. Reivew of linear models (60pts)
The swiss data — use Fertility as the response to practice.
1.1 Q1.1
An initial data analysis that explores the numerical and graphical characteristics of the data.(5pts)
Solution According to online sources, the Swiss Fertility and Socioeconomic Indicators dataset represent the standardized fertility measure and socioeconomic indicators for each of the 47 French-speaking provinces in Switzerland, dated in 1888.
The swiss data set is a data frame with 47 observations of 6 variables: Fertility (numeric), Agriculture (numeric), Examination (integer), Education (integer), Catholic (numeric), and Infant Mortality (numeric). The variables are represented as follows: 1. Fertility (commmon fertility measure in Ig), 2. Agriculture (percentage of males involved in agriculture as occupation), 3. Examination (percentages of draftees receiving highest mark on army examination), 4. Education (percentage eduation beyond primary school for draftees), 5. Catholic (percent of catholics as opposed to protestants), 6. Infant Mortality (live births who live less than 1 year).
The summary statistics (minimum, 1st quartile, median, mean 3rd quartile, and maximum) of the variables in the swiss dataset are as follows: We are able to obtain the numerical summaries of all these predictor variables becuase they are all quantitiative.
summary(swiss)
Provinces Fertility Agriculture Examination
Length:47 Min. :35.00 Min. : 1.20 Min. : 3.00
Class :character 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00
Mode :character Median :70.40 Median :54.10 Median :16.00
Mean :70.14 Mean :50.66 Mean :16.49
3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00
Max. :92.50 Max. :89.70 Max. :37.00
Education Catholic Infant.Mortality
Min. : 1.00 Min. : 2.150 Min. :10.80
1st Qu.: 6.00 1st Qu.: 5.195 1st Qu.:18.15
Median : 8.00 Median : 15.140 Median :20.00
Mean :10.98 Mean : 41.144 Mean :19.94
3rd Qu.:12.00 3rd Qu.: 93.125 3rd Qu.:21.70
Max. :53.00 Max. :100.000 Max. :26.60
Data Visualization Plots: Univariate
Univariate plot of the Fertility Variable (response variable): The distribution appears to be slightly right skewed (positive skewed) where most provinces have moderate fertility rates and a few relatively high nes. There is a single peak, making the distribution approximately unimodal where the most common fertility rates are between 70-75. There are some provinces that have unusually low or high fertility rates (around 35 or 95).
ggplot(swiss, aes(x = Fertility)) +geom_histogram(aes(y = ..density..), bins =20, fill ="skyblue", color ="black", alpha =0.6) +geom_density(color ="darkblue", linewidth =1) +labs(title ="Histogram of Fertility Rates in Swiss Provinces",x ="Fertility Rate",y ="Density" ) +theme_minimal()
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
Boxplots for the Predictor Variables:
The boxplots show the distribution of the predictor variables (Agriculture, Catholic, Education, Examination, and Infant Mortality) in the swiss dataset. Brief summary of the boxplots: 1. Agriculture: The median percentage of males involved in agriculture as occupation lies between 50-60 percent, the distribution is moderately spread out, and there no clear outliers. 2. Catholic: The distribution for the percent of catholics as opposed to protestants is severely right skewed (positve skewed), the spread is extremely wide, and there are no visible outliers. This indicates that most of the provinces have a small percentage of Catholics. 3. Education: The percentage education beyond primary school for draftees is slightly right skewed (positive skewed), the median is low (between 0 and 10), there are a few notable outliers in the distribution (greater than 30) but the spread of the percentages is narrow. 4. Examination: The distribution for the percentage of draftees receiving highest marks on army examinations is slightly skewed to the right (positively skewed) with no strong outliers and the median is near the lower end of the IQR. 5. Infant Mortality: The distribution that represents the number of live births who live less than 1 year is fairly symmetrical with a slightly longer upper whisker, the range is narrow, ad the median is around 20.
predictor_vars <-c("Agriculture", "Examination", "Education", "Catholic", "Infant.Mortality")swiss_long <- swiss %>%pivot_longer(cols =all_of(predictor_vars), names_to ="Variable", values_to ="Value")ggplot(swiss_long, aes(x ="", y = Value)) +geom_boxplot(fill ="lightgreen") +facet_wrap(~ Variable, scales ="free_y") +labs(title ="Boxplots of Predictor Variables", x ="", y ="") +theme_minimal()
Density plots for all variables including Fertility
For better visualization of the distributions of all the predictor variables and the response variable, Fertility, we plot the density plots. The distributions reflect the same characterisitics as the histograms above.
We will create several scatter plots between 2 variables because all 6 variables are continuous. Description of the Scatter Plots: 1. Fertility versus Agricutlure: The trend below shows that the higher agriculture activity correlates with higher fertility rates (positive relationship). This aligns with the historical patterns where economies primarly based on agriculture often have larger familities due to labor needs and lower urbanization. 2. Fertility versus Education: The higher education levels often correlate with lower fertility rates. 3. Fertility versus Infant Mortality: The higher infant morality rates, the higher fertility (parents have more children to offset expected losses.) 4. Fertility versus Examination: Fertility may decline as the percentages of draftees receiving highest marks on army examination increases. 5. Fertility versus Catholic: Regions with higher Catholic populations may have higher fertility due to religious norms.
plot_list <-list()predictors <-c("Agriculture", "Examination", "Education", "Catholic", "Infant.Mortality")for(predictor in predictors) { p <-ggplot(swiss, aes_string(x = predictor, y ="Fertility")) +geom_point() +geom_smooth(method ="lm", se =FALSE) +ggtitle(paste("Fertility vs", predictor)) +theme_minimal() plot_list[[predictor]] <- p}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
grid.arrange(grobs = plot_list, ncol =2)
Data Visualization: Correlation Plots Description: The correlation matrix shows the pairwise comparisons of the predictor variables. We are able to see that educaiton and examination are highly positively correlated (0.70) indicating that higher education levels are assocaited with better examination outcomes. Education and Agriculture have a strong negative correlation (-0.64) suggesting that regions with higher agricultural activity tend to have lower education. The predictor variables Catholic and Fertility are negatively correlated (-0.60). Infant mortality showcases weaker correlations with the other variables.
cor_matrix <-cor(swiss[, -1]) corrplot(cor_matrix, method ="number", type ="upper", order ="hclust", tl.col ="black", tl.srt =45, addCoef.col ="black")
1.2 Q1.2
Variable selection to choose the best model. (10pts)
Solution We used the stepwise selection method to identify the best linear model for predicting Fertility based on the predictors: Agriculture, Examination, Education, Catholic, and Infant Mortality. The full model included all main effects as well as all possible two-way interactions (e.g., Agriculture:Examination). The step() function was applied to iteratively add and remove terms in order to minimize the Akaike Information Criterion (AIC), using both forward and backward selection.
The resulting best model achieved the lowest AIC score of 317.41 and included the following terms: Main effects: Agriculture, Examination, Education, Catholic, Infant Mortality Interaction terms: Agriculture:Examination, Agriculture:Education, Agriculture:Infant.Mortality, Examination:Education, Examination:Infant.Mortality, and Education:Catholic.
All variables in the model were statistically significant except for the interactions Agriculture:Examination and Agriculture:Education. The model has an R-squared value of 0.811, indicating that approximately 81% of the variance in the Fertility variable is explained by the model. Additionally, the overall model is highly significant, with a p-value of 1.283e-09.
full_model <-lm(Fertility ~ (Agriculture + Examination + Education + Catholic + Infant.Mortality)^2, data = swiss)best_model <-step(full_model, direction ="both", trace =FALSE)summary(best_model)
The following code performs single-term deletion diagnostics on the best model obtained earlier. It evaluates how the model’s performance would change if each term were removed individually. The results show that two interaction terms — Agriculture:Examination and Agriculture:Education — have p-values greater than 0.05, indicating they are not statistically significant.
Additionally, the RSS (Residual Sum of Squares) values associated with these terms are lower compared to the others, meaning that removing these terms results in only a slight increase (or even improvement) in model error. This suggests that these predictors do not contribute meaningfully to improving model fit, and could potentially be removed to simplify the model without significantly comprising performance.
We will further test to see if the non-signifcant interaction terms can be dropped. Because the p-value is greater than 0.05, we are able to conclude that the two interaction terms: Agriculture:Examination and Agriculture:Education can be dropped.
Refitting the final model after dropping the two nonsignificant interaction terms gives us the following: \[y=160.178
−1.217⋅Agriculture
−3.037⋅Examination
−0.501⋅Education
+0.179⋅Catholic
−3.462⋅Infant.Mortality
+0.051⋅(Agriculture×Infant.Mortality)
+0.004⋅(Examination×Education)
+0.127⋅(Examination×Infant.Mortality)
−0.012⋅(Education×Catholic) \]
An exploration of transformations to improve the fit of the model. (10pts)
Solution
To improve model fit, we perform a search to identify the best-fitting linear regression model for predicting the Fertility variable in the swiss dataset by testing different transformations (identity, log, inverse, B-splines, and orthogonal polynomials) which are applied to the five predictor variables (Agriculture, Examination, Education, Catholic, and Infant.Mortality) based on how the distribution of each variable looks.
We will first impose a log transformation on the variable Agriculture. The density plot of the predictor Agriculture showcases that the distribution is right skewed and a log transformation can help normalize the predictor.
The log-transformed plot adds curvature and introduces more scatter and a less precise fit in the partial residuals. In addition, we are able to see from the AIC scores of the original model and the log-transformed plot (only Agriculture) that the original model has better overall model fit. Therefore, we will explore other transformations.
Next, we see that the distribution for examination appears bimodal and therefore we will impose a polynomial trasnformation to capture its nonlinearity.
The plot for the poly-transformed examination shows that the curvature is not very strong as the partial residuals appear scattered and the line does not clearly capture a stronger pattern. We are able to interpret this as the polynomial term overfitting or adding unncessary complexity. In addition, the AIC score for the final model after transforming the examination predictor variable is greater than that of the AIC score for the original model. Therefore, although the Examnation predictor seems to be bimodal, it is better to not transform it.
Next, we see that the distribution for education is right skewed. Therefore, we will impose a log transformation on Education.
The termplot for “Log-Transformed Education” shows that the curve is nonlinear, flattening out as Education increases. The residuals appear more scattered and the trend is less distinct. The curvature does not enhenace interpretability of the model fit, despite the Education predictor being right skewed. The AIC score for the model with the log transformed Education predictor is also higher than that of the original model. Therefore, we can conclude that the transformation is not necessary.
Next, we see that the distribution for the variable Catholic has two distinct peaks. Therefore, we will impose B-splines on this predictor variable to capture its nonlinearity and model this multi-modal pattern.
We are able to see that the B-spline (df = 3) introduces flexibility, allowing for a nonlinear curve which potentially better reflects multimodal patterns, especially around the extreme values of the Catholic variable. However, when comparing the AIC scores between the original model and the B-spline transformed, model, we see that the original model has a lower score. Therefore, a transformation on the Catholic variable is not necessary.
Lastly, we see that the distribution for the Infant Mortality variable is somewhat normal with a right tail. Therefore, we will impose a squared root transformation on this predictor variable to capture its nonlinearity.
The termplot for the square root transformed variable for Infant Mortality introduces nonlinearity, curving upward slightly. We are able to see that the partial residuals are more dispersed and less tightly aligned to the trend. The visual signal appears weaker and less interpretable. In addition, the AIC score for the square root transformed model is greater than that of the original model. Therefore, we can conclude that the square root transformation is not necessary.
After analysis on the effects of these transformations, we conclude that the best final model is characterized by:
Diagnostics to check the assumptions of your model. (10pts)
Solution 1. Residuals versus Fitted Values: The residuals appear randomly scattered around 0 (horizontal line), suggesting no severe non-linearity. We are able to see that there is a slight funnel shape (wider spread at higher fitted values) which may indicate mild heteroscedasticity (non-constant variance).
Scale Location (Spread-Location): We will now assess the homoscedasticity (constant variance of residuals). We are able to see that the points are roughly horizontal with minor fluctuations, which suggests moderate homoscedasticity.
Q-Q Residuals: We will test the normality of residuals. In the graph, the points mostly follow the dashed 45 degree line, indicating approximate normality. Minor deviations in the tails are not severe enough to be a problem.
Residuals versus Leverage: We will plot this graph to identify influential outliers (high leverage + high residuals). We are able to see that the no points fall outside Cook’s distance contours (red dashed lines), suggesting that there are no highly influential outliers. There a few points with high leverage but these modest residuals pose little risk.
par(mfrow =c(2, 2))plot(final_best_model)
par(mfrow =c(1, 1))
The results of the Shapiro Wilk normality test are shown below. The p-value is 0.2238. Therefore, we fail to reject the null hypothesis that the residuals are normally distributed. Therefore, this suggests that the model’s residuals do not significantly deviate from normality.
shapiro.test(residuals(final_best_model))
Shapiro-Wilk normality test
data: residuals(final_best_model)
W = 0.96809, p-value = 0.2238
Lastly, we will run the Studentized Breusch-Pagan test to check for heteroscedasticity (non-constant variance of residuals) in a linear regression model. Because the residual plots from above hinted at heteroscedastcity, we will run this test to ensure that the residuals have constant variance. The p-value resulting from the Breusch-Pagan Test is 0.2305so we fail to reject the null hypothesis. Therefore, we are able to conclude that the residuals have constant variance.
bptest(final_best_model)
studentized Breusch-Pagan test
data: final_best_model
BP = 11.704, df = 9, p-value = 0.2305
1.5 Q1.5
Some predictions of future observations for interesting values of the predictors.(5pts)
Solution We first created a data frame containing various combinations of the predictor variables. Specifically, we used values of 10, 50, and 90 for Agriculture; a sequence of scores from 3 to 30 (by 5) for Examination; values of 5, 15, and 25 for Education; values of 10, 50, and 90 for Catholic; and values of 10, 20, and 30 for Infant.Mortality. Using this grid, we applied the final_best_model to generate predicted Fertility values for each combination, along with corresponding 95% confidence intervals for the mean predictions. The resulting data frame is displayed below.
An interpretation of the meaning of the model by writing a scientific abstract. (<150 words) (10pts)
BACKGROUND: brief intro of the study background, what are the existing findings: In the late 19th century, Switzerland underwent a demographic transition marked by declining fertility. Understanding how cultural, religious, and economic factors influenced fertility offers insights into population change and informs modern demographic strategies.
OBJECTIVE: state the overall purpose of your research, e.g., what kind of knowledge gap you are trying to fill in:
This study examines how socioeconomic and cultural variables—Agriculture, Examination, Education, Catholicism, and Infant Mortality—are associated with fertility rates.
METHODS: study design (how these data were collected), outcome definitions, statistical procedures used
Using the swiss dataset (47 French-speaking provinces, 1888), we performed exploratory analysis, built a full linear model with two-way interactions, applied stepwise AIC selection, and evaluated predictor transformations (log, polynomial, B-splines). Model assumptions were checked via residual plots, Shapiro-Wilk, and Breusch-Pagan tests.
RESULTS: summary of major findings to address the question raised in objective
The final model (Adjusted R² = 0.72; AIC = 321.45) retained predictors in original scale, including four key interaction terms. Transformations did not improve fit.
CONCLUSIONS: Multivariable modeling reveals that fertility is influenced by complex interactions among education, religion, and health—offering insight for demographic and policy planning.
2 Q2.(70pts)
The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The purpose of the study was to investigate factors related to diabetes. The data may be found in the the dataset pima.
pregnant glucose diastolic triceps
Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
insulin bmi diabetes age
Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
test
Min. :0.000
1st Qu.:0.000
Median :0.000
Mean :0.349
3rd Qu.:1.000
Max. :1.000
2.1 Q2.1
Create a factor version of the test results and use this to produce an interleaved histogram to show how the distribution of insulin differs between those testing positive and negative. Do you notice anything unbelievable about the plot? (5pts)
ggplot(pima, aes(x = insulin, fill = test)) +geom_histogram(position ="identity", alpha =0.5, bins =30) +labs( title ="Insulin Levels by Test Result",x ="Insulin",y ="Count",fill ="Test Result" ) +theme_minimal()
Abnormalities of the graph: The histogram displays a large concentration of insulin values at 0, which is highly improbable for human insulin levels. This anomaly likely indicates missing or unrecorded measurements that were incorrectly entered as zero, rather than representing true physiological data. Additionally, the graph reveals the presence of extreme outliers, with some insulin values exceeding 750. Such values are not biologically plausible and may be the result of data entry errors or measurement inaccuracies.
2.2 Q2.2
Replace the zero values of insulin with the missing value code NA. Recreate the interleaved histogram plot and comment on the distribution. (5pts)
ggplot(pima, aes(x = insulin_clean, fill = test)) +geom_histogram(position ="identity", bins =30, alpha =0.7, na.rm =TRUE) +labs(title ="Distribution of Insulin Levels by Diabetes Test Result (Excluding Zeros)",x ="Insulin",y ="Count",fill ="Test Result" ) +theme_minimal()
Creating the interleaved histogram plot: Amongst the Pima Indians who tested negative (pink), the majority of individuals have insulin levels clustered between 0-150. There is a peak at around insulin levels of 100-125. In addition, the distribution of insulin levels is right skewed where there are fewer people with higher insulin levels. There are a couple of outliers of individuals whose insulin levels are above 300. Amongst the Pima Indians who tested positive (blue), the majority of individuals have insulin levels clustered between 100-250. The distribution is more spread out compard to the distribution corresponding to the inuslin levels of the Pima Indians who tested negative. Compared to the negative group, the positive group has a longer right tail, which suggests that some individuals who tested positive have very high insulin levels of above 500. We are able to see that the spike at insulin = 0 that we saw in the previous graph is gone because these 0 values were replaced with NA’s which were omitted in the plot. The histogram, therefore, reflects only a fraction of the data which may lead to bias interpretations.
2.3 Q2.3
Replace the incredible zeroes in other variables with the missing value code. Fit a model with the result of the diabetes test as the response and all the other variables as predictors. How many observations were used in the model fitting? Why is this less than the number of observations in the data frame. (10pts)
Solution We observe from the output of summary(pima) that several predictor variables have a minimum value of 0, specifically: glucose, diastolic, triceps, insulin, and bmi. These values are medically implausible and likely represent missing data, so we will convert them to N/A. However, we will not convert the 0 values in the pregnant variable, as it is perfectly reasonable for individuals in the dataset to have had zero pregnancies and is therefore not indicative of missing data.
Call:
glm(formula = test ~ pregnant + glucose + diastolic + triceps +
insulin + bmi + diabetes + age, family = binomial, data = pima_clean)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.004e+01 1.218e+00 -8.246 < 2e-16 ***
pregnant 8.216e-02 5.543e-02 1.482 0.13825
glucose 3.827e-02 5.768e-03 6.635 3.24e-11 ***
diastolic -1.420e-03 1.183e-02 -0.120 0.90446
triceps 1.122e-02 1.708e-02 0.657 0.51128
insulin -8.253e-04 1.306e-03 -0.632 0.52757
bmi 7.054e-02 2.734e-02 2.580 0.00989 **
diabetes 1.141e+00 4.274e-01 2.669 0.00760 **
age 3.395e-02 1.838e-02 1.847 0.06474 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 498.10 on 391 degrees of freedom
Residual deviance: 344.02 on 383 degrees of freedom
(376 observations deleted due to missingness)
AIC: 362.02
Number of Fisher Scoring iterations: 5
n_used <-nobs(full_model)n_total <-nrow(pima_clean)cat("Observations used in model fitting:", n_used, "\n","Total observations in data frame:", n_total, "\n","Observations omitted due to missing values:", n_total - n_used, "\n")
Observations used in model fitting: 392
Total observations in data frame: 768
Observations omitted due to missing values: 376
The number of observations used in the model fitting is 392. This is fewer than the 768 observations in the original data frame because the glm() function uses only rows with no missing values in any predictor (NA’s are excluded). The original dataset included rows with invalid zero values in variables such as glucose, diastolic, triceps, insulin, and bmi, which are not physiologically realistic. These zeroes were replaced with NA, and the rows containing them were excluded during model fitting to ensure data quality. It is important to note that the number of observations that were omitted due to missing values is 376 which makes up a significant portion of the dataset. Using such data may cause bias interpretation, as mentioned below and so it is important to use imputation methods in later interpretations.
2.4 Q2.4
Refit the model but now without the insulin and triceps predictors. How many observations were used in fitting this model? Devise a test to compare this model with that in the previous question. (10pts)
Call:
glm(formula = test ~ pregnant + glucose + diastolic + bmi + diabetes +
age, family = binomial(link = "logit"), data = pima_clean)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.962146 0.820892 -10.918 < 2e-16 ***
pregnant 0.117863 0.033418 3.527 0.00042 ***
glucose 0.035194 0.003605 9.763 < 2e-16 ***
diastolic -0.008916 0.008618 -1.035 0.30084
bmi 0.090926 0.015740 5.777 7.61e-09 ***
diabetes 0.960515 0.306415 3.135 0.00172 **
age 0.016944 0.009834 1.723 0.08489 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 931.94 on 723 degrees of freedom
Residual deviance: 672.86 on 717 degrees of freedom
(44 observations deleted due to missingness)
AIC: 686.86
Number of Fisher Scoring iterations: 5
n_used_reduced <-nobs(model_reduced)cat("Observations used in reduced model that removed insulin and triceps:", n_used_reduced, "\n")
Observations used in reduced model that removed insulin and triceps: 724
cat("Observations used in Full Model:", n_used, "\n")
Observations used in Full Model: 392
cat("Total observations in data frame:", n_total, "\n")
Total observations in data frame: 768
We refit the model without the insulin and triceps predictors to create a reduced model. The number of observations that were used in this reduced model is 724 which is greater than the number of observations that were used in the full model (392 observations). It is possible that the reduced model has more observations than that of the full model because the reduced model did not drop the rows where there were NA values in the insulin and triceps columns.
Likelihood ratio test
Model 1: test ~ pregnant + glucose + diastolic + triceps + insulin + bmi +
diabetes + age
Model 2: test ~ pregnant + glucose + diastolic + bmi + diabetes + age
#Df LogLik Df Chisq Pr(>Chisq)
1 9 -172.01
2 7 -172.44 -2 0.8593 0.6507
Test to compare this model to the previous model (that included insulin and triceps): Because the reduced and full models contain a different number of observations, we will use the update_nested function to ensure that the reduced model is fitted on the exact same 392 observations as the full model, which makes the comparison valid. We use the likelihood ratio test to compare the two models. The likelihood ratio test compares the goodness of fit of two models and checks whether removing the two predictors - insulin and triceps significantly worsens the model’s fit where the null hypothesis is that the reduced model fits the data just as well as the full model. The p-value is 0.6507 (larger than 0.05) so we fail to reject the null hypothesis. Therefore, we have sufficient evidence to conclude the reduced model (without the 2 predictors insulin and triceps) fits the data just as well as the full model and do not significantly contribute to the model when the other predictors are already included in the model. Therefore, the simpler model (reduced model without the insulin and triceps predictors) is preferred.
2.5 Q2.5
Use AIC to select a model. You will need to take account of the missing values. Which predictors are selected? How many cases are used in your selected model? (10pts)
Solution
We perform step-wise selection, adding or dropping predictors to minimize the AIC. We start with the full model that contains all the predictors (pregnant, glucose, diastolic, tricpes, insulin, bmi, diabetes, and age) after removing all the rows that contain NA values. The model selection process of AIC balances the goodness of fit (how well the model explains the data) and the model complexity (penalizes the number of predictors). The lower the AIC, the better the model. Once we start with the full model, we iteratively add or remove predictors (both forward seelctin and backward elimination).
Selected predictors: pregnant glucose bmi diabetes age
cat("Number of cases used in selected model:", nobs(selected_model), "\n")
Number of cases used in selected model: 392
cat("AIC of selected model:", AIC(selected_model), "\n")
AIC of selected model: 356.8851
The model with the lowest AIC value comes out to be the one with the following variables: pregnant + glucose + bmi + diabetes + age. The selected model used 5 predictor variables out of 8. The number of cases used in the selected model is 392 (utilized the dataset that removed all the rows containing the NA values for each predictor variable.) The AIC value of the selected model is 356.8851. The selected model can be characterized as: \[y = -9.992 + 0.084X_1 + 0.036X_2 + 0.078X_3 + 1.151X_4 + 0.034X_5\]
2.6 Q2.6
Create a variable that indicates whether the case contains a missing value. Use this variable as a predictor of the test result. Is missingness associated with the test result? Refit the selected model, but now using as much of the data as reasonable. Explain why it is appropriate to do this. (10pts)
Solution
First, we create a missingness indicator where 1 means that the row has a missing value (NA) and 0 means that the row does not have any missing values. We indicate this missingness indicator as has_missing. Note that we utilize the pima_clean dataset which is the data after all the 0 values are converted into N/A values. We then use the has_missing variable as a predictor of the test result and we notice that the p-value for the variable is 0.304. This signifies that the cases with missing values are not significantly more likely to test positive for diabetes than complete cases. In other words, we don’t have evidence to conclude that the presence of missing data is associated with the test outcome.
pima_clean$has_missing <-ifelse(rowSums(is.na(pima_clean[vars_to_clean])) >0, 1, 0)missing_test <-glm(test ~ has_missing, family = binomial, data = pima_clean)summary(missing_test)
Call:
glm(formula = test ~ has_missing, family = binomial, data = pima_clean)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.7008 0.1073 -6.533 6.47e-11 ***
has_missing 0.1558 0.1515 1.028 0.304
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 993.48 on 767 degrees of freedom
Residual deviance: 992.43 on 766 degrees of freedom
AIC: 996.43
Number of Fisher Scoring iterations: 4
Next, we will refit the selected model using as much of the data as reasonable. Since missingness isn’t significantly associated with the outcome, it is reasonable to assume that the data is missing at random. Therefore, instead of removing rows with missing values (which reduces sample size and power), we can impute the missing values with the mean/median or use other imputation methods such as MICE or keep as many observations as possible where variables required for the model are present. This ensures that more data is retained, increasing the statistical power of the final model. In addition, we are able to avoid any biases introduced by only analyzing complete cases, especialy if missingness isn’t strongly related to the “test” response outcome.
The next chunk of code performs multiple imputation to handle missing data and then analyzes the imputed datasts. We first create 5 imputed datasets to account for the missing values using the (MICE or Multivariate Imputation by Chained Equations) algorithm where the algorithm predicts the missing values based on observed data and variable relationships. We will then refit the previously selected model (logistic regression with the predictors pregnant + glucose + bmi + diabetes + age) on each imputed dataset (5 total datasets). The results from all the of the 5 models are combined into a single pooled estimate.
After refitting the selected model with the imputed datasets, we get the following model: \[y = -9.3813 + 0.1230X_1 + 0.0362X_2 + 0.0885X_3 + 0.8760X_4 + 0.0108X_5\] ### Q2.7
Using the last fitted model of the previous question [model selected in Question 2.5], compute the odds ratio of testing positive for diabetes for a woman with a BMI at the first quartile compared with a woman at the third quartile, assuming that all other factors are held constant? Give a confidence interval for this odds ratio. (10pts)
Solution
We are utilizing the data set after accounting for the missing values (after removing all NA values). The
The odds ratio of testing positive for diabetes for a woman with a BMI at the first quartile compared with a woman at the third quartile, assuming that all other factors are held constant is 1.973495. In other words, a woman with a BMI at the third quartile (37.1) has 1.9743 times the odds of testing poistive for diabetes compared to a woman with a BMI at the 1st quartile (28.4), assuming that all other predictors (pregnancies, glucose, and age are held constant). The 95 percent confidence interval for the odds ratio is (1.39-2.80). In other words, we are 95 percent confident that the true odds ratio for diabetes risk (comparing Q3 versus Q1 BMI) lies between 1.36 and 2.64 in the population.
2.7 Q2.8
Do women who test positive have higher diastolic blood pressures? Is the diastolic blood pressure significant in the regression model? Explain the distinction between the two questions and discuss why the answers are only apparently contradictory. (10pts)
Solution In order to check if women who test positive have higher diastolic blood pressures on average, we will compare the mean/median values of the diastolic blood pressures between the two testing groups (those who tested positive versus negative) by using a t-test or the Wilcoxon test. We will first check if the distributions of the Diastolic blood pressures within each group (Negative/Positive Tests) are normally distributed or not.
hist(pima_clean$diastolic[pima_clean$test =="Negative"], main ="Distribution of Diastolic BP (Negative Test)")
hist(pima_clean$diastolic[pima_clean$test =="Positive"], main ="Distribution of Diastolic BP (Positive Test)")
We are able to see that the distributions are normally distributed and so we are able to use the t-test (parametric test) to compare the means of the diastolic blood pressures between the two groups (Positive and Negative tests).
t.test(diastolic ~ test, data = pima_clean)
Welch Two Sample t-test
data: diastolic by test
t = -4.6643, df = 504.72, p-value = 3.972e-06
alternative hypothesis: true difference in means between group Negative and group Positive is not equal to 0
95 percent confidence interval:
-6.316023 -2.572156
sample estimates:
mean in group Negative mean in group Positive
70.87734 75.32143
We are able to see from the results of the two sample t-test that the difference in the means of the Diastolic Blood Pressure between the two groups (Positive and Negative) is statistically significant because the p-value is smaller than 0.05. The means in each group are as follows: Negative group (the diastolic blood pressure mean is 70.87) and the Positive group (the diastolic blood pressure mean is 75.32143.) Therefore, we are able to conclude that women who test positive do have higher diastolic blood pressures.
Next, we check whether the predictor variable diastolic blood pressure is significant in the regression model. The regression model that is utilized in this question is the best model that was found in 2.5 with the diastolic blood pressure predictor variable added. The p-value corresponding to diastolic blood pressure is 0.945949 which is not significant in the full model. This means that DBP does not contribute significantly to the regression model when other variables are included.
Call:
glm(formula = test ~ pregnant + glucose + bmi + diabetes + age +
diastolic, family = binomial, data = pima_complete)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.9604796 1.1818764 -8.428 < 2e-16 ***
pregnant 0.0840497 0.0550728 1.526 0.126971
glucose 0.0364863 0.0049973 7.301 2.85e-13 ***
bmi 0.0785728 0.0215674 3.643 0.000269 ***
diabetes 1.1492368 0.4250340 2.704 0.006854 **
age 0.0346079 0.0181919 1.902 0.057121 .
diastolic -0.0008002 0.0118034 -0.068 0.945949
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 498.10 on 391 degrees of freedom
Residual deviance: 344.88 on 385 degrees of freedom
AIC: 358.88
Number of Fisher Scoring iterations: 5
The first question examines a univariate association (diastolic blood pressure versus the test outcome while not taking account for the other predictors in the model) while the second question examines a multivariate association (the diastolic blood pressure’s contribution after accounting for other predictors). Although the t-test shows that the diastolic blood pressure differs between the groups in isolation, the regression model shows that the diastolic blood pressure predictor variables does not provide additional predictive power beyond the other variables in the model. Therefore, the answers seem contradictory because in the t-test we were able to see that the diastolic blodo pressure alone is associated with the test outcome. In the regression, diastolic blood pressure may be correlated with other predictors, therefore making it insiginificant. In addition, even though the mean difference between the diastolic blood pressures between the two testing groups may be statistically significant, it doesn’t improve the prediction meaningfully in a multivariate model.